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Abstract 

The visual layout of a webpage can provide valuable clues for certain types 
of Information Extraction (IE) tasks. In traditional rule based IE frameworks, 
these layout cues are mapped to rules that operate on the HTML source of the 
webpages. In contrast, we have developed a framework in which the rules can be 
specified directly at the layout level. This has many advantages, since the higher 
level of abstraction leads to simpler extraction rules that are largely independent 
of the source code of the page, and, therefore, more robust. It can also enable 
specification of new types of rules that are not otherwise possible. To the best of 
our knowledge, there is no general framework that allows declarative specification 
of information extraction rules based on spatial layout. Our framework is 
complementary to traditional text based rules framework and allows a seamless 
combination of spatial layout based rules with traditional text based rules. We 
describe the algebra that enables such a system and its efficient implementation 
using standard relational and text indexing features of a relational database. 
We demonstrate the simphcity and efficiency of this system for a task involving 
the extraction of software system requirements from software product pages. 


1 Introduction 

Information in web pages is laid out in a way that aids human perception using 
specification languages that can be understood by a web browser, such as HTML, CSS, 
and Javascript. The visual layout of elements in a page contain valuable clues that 
can be used for extracting information from the page. Indeed, there have been several 
efforts to use layout information for specific tasks such as web page segmentation 
and table extraction [^. There are two ways to use layout information: 
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1. Source Based Approach: Map the layout rule to equivalent rules based on 
the source code (html) of the page. For example, alignment of elements can be 
achieved in HTML by using a list (<li>) or a table row (<tr>) tag. 

2. Layout Baised Approach: Use the layout information (coordinates) of various 
elements obtained by rendering the page to extract relevant information. 

Both these approaches achieve the same end result, but the implementations are 
different as illustrated in the example below. 

Example 1.1 Figure [7] shows the system requirements page for an IBM software 
product. The IE task is to extract the set of operating systems supported by the product 
(listed in a column in the table indicated by Q3). In the source based approach, the rules 
need to identify the table, its rows and columns, the row or column containing the word 
‘Operating Systems ’, and finally a list of entities, all based on the tags that can be used 
to implement them. In the layout based approach, the rule can be stated as: ‘From each 
System Requirements page, extract a list of operating system names that appear stnct/Jj 
to the south of the word ‘Operating Systems’ and are vertically aligned’. The higher leva 
layout based rule is simpler, and is more robust to future changes in these web pages. 



Figure 1: System requirements page 


Source based rules have several serious limitations, as listed below: 


• An abstract visual pattern can be implemented in many different ways by the 
web designer. For example, a tabular structure can be implemented using any of 
<table>, <div> and <li> tags. Lerman et al m show that only a fraction 
of tables are implemented using the <table> tag. Source-based rules that use 
layout cues need to cover all possible ways in which the layout can be achieved. 
Our experience with large scale IE tasks suggest that rules that depend on HTML 


^See section 3.3 


for a definition of strictness 
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tags and DOM trees work reasonably well on template based machine-generated 
pages, but become too complex and brittle to maintain when applied to manually 
authored web pages. 

• Proximity of two entities in the HTML source code does not necessarily imply 
visual proximity [9], and so it may not be possible to encode visual proximity 
cues using simple source based rules. 

• Specification languages are becoming more complex and difficult to analyze. 
Visualization logic is often embedded in CSS and Javascript, making the process 
of rule writing difficult. 

• Rules based on HTML tags and DOM trees are often sensitive to even minor 
modifications of the web page, and rule maintenance becomes messy. 

Layout based approaches overcome these limitations since they are at a higher level 
and independent of the page source code. Previous efforts at using layout based ap¬ 
proaches were targeted at specific tasks such as page segmentation, wrapper extraction, 
table extraction, etc and are implemented using custom code. Existing rule based infor¬ 
mation extraction frameworks do not provide a mechanism to express rules based on the 
visual layout of a page. Our goal is to address this gap by augmenting a rule based infor¬ 
mation extraction framework to be able to express layout based rules. Rule based system 
can be either declarative [T51[T3] or procedural [T]. It has been shown that expressing in¬ 
formation extraction (IE) tasks using an algebra, rather than procedural rules or custom 
code, enables systematic optimizations making the extraction process very efficient [H 
ng. Hence, we focus on an algebraic information extraction framework described in m 
and extend its algebra with a visual operator algebra that can express rules based on spa¬ 
tial layout cues. One of the challenges is that not all rules can be expressed using layout 
cues alone. Eor some rules, it may be necessary to use traditional text-based matching 
such as regular expressions and dictionaries, and combine them with spatial layout based 
rules. The framework thus needs to support rules that use both traditional textual match¬ 
ing and high-level spatial layout cues. In summary, our contributions are as follows: 

• We have developed an algebraic framework for rule-based information extraction 
that allows us to seamlessly combine traditional text-based rules with high-level 
rules based on the spatial layout of the page by extending an existing algebra 
for traditional text based information extraction [T2], with a visual operator 
algebra. We would like to reiterate that our focus is not on developing spatial 
rules for a specific task, rather we want to develop an algebra using which spatial 
rules for many different tasks can be expressed. 

• We implement the system using a relational database and demonstrate how the 
algebra enables optimizations by systematically mapping the algebra expressions 
to SQL. Thus, the system can benefit from the indexing and optimization features 
provided by relational databases. 

• We demonstrate the simplicity of the visual rules compared to source based rules 
for the tasks we considered. We also conduct performance studies on a dataset 
with about 20 million regions and describe our experience with the optimizations 
using region and text indices. 
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2 Related Work 


Information Extraction(IE): IE is a mature area of research that has received 
widespread attention in the NLP, AI, web and database communities [13]. Both rule 
based and machine learning based approaches have been proposed and widely used 
in real life settings. In this paper, we extend the operator algebra of System T [15] 
to support rules based on spatial layout. 

Eranneworks for Information Extraction: The NLP community has devel¬ 
oped several software architectures for sharing armotators, such as GATE [J and 
UIMA [ 3 . The motivation is to provide a reusable framework where annotators 
developed by different providers can be integrated and executed in a workflow. 

Visual Information Extraction: There is a lot of work on using visual in¬ 
formation for specific tasks. We list some representative work below. The VIPS 
algorithm described in [3] segments a DOM tree based on visual cues retrieved from 
browser’s rendering. The VIPS algorithm complements our work as it can act as a 
good preprocessing tool performing task-independent page structure analysis before 
the actual visual extraction takes place - thereby improving extraction accuracy. A 
top-down approach to segment a web page and detect its content structure by dividing 
and merging blocks is given in 0- i use visual information to build up a “M-tree”, a 
concept similar to the DOM tree enhanced with screen coordinates. [B] describe a com¬ 
pletely domain-independent method for IE from web tables, using visual information 
from Mozilla browser. All these approaches are implemented as monolithic programs 
that are meant for specific tasks. On the other hand, we are not targeting a specific 
task; rather our framework can be used for different tasks by allowing declarative 
specification of both textual and visual extraction rules. 

Another body of work that is somewhat related is automatic and semi-automatic 
wrapper induction for information extraction [5]. These methods learn the a template 
expression for extracting information based on some training sets. The wrapper based 
methods work well on pages that have been generated using a template, but do not 
work well on human authored pages. 

3 Visual Algebra 

3.1 Overview of Algebraic Information Extraction 

We start with a system proposed by Reiss et al [T5] and extend it to support visual 
extraction rules. First, we give a quick summary of their algebra. For complete details, 
we request the reader to refer to the original paper. 

Data Model 

A document is considered to be a sequence of characters ignoring its layout and other 
visual information. The fundamental concept in the algebra is that of a span, an 
ordered pair {begin, end) that denotes a region or text within a document identified 
by its “begin” and “end” positions. Each annotator finds regions of the document 
that satisfy a set of rules, and marks each region with an object called a span. 

The algebra operates over a simple relational data model with three data types: 
span, tuple, and relation. A tuple is an finite sequence of w spans (si,..., s-w)', where 
w is the width of the tuple. A relation is a multiset of tuples, with the constraint that 
every tuple in the relation must be of the same width. Each operator takes zero or 
more relations as input and produces a single output relation. 
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Operator Algebra 

The set of operators in the algebra can be categorized broadly into relational operators, 
span extraction operators, and span aggregation operators as shown in Table 
Relational operators include the standard operators such as select, project, join, union, 
etc. The span extraction operators identify segments of text that match some pattern 
and produce spans corresponding to these matches. The two common span extraction 
operators are the regular expression matcher e^e and the dictionary matcher Cd- The 
regular expression matcher takes a regular expression r, matches it to the input text 
and outputs spans corresponding to these matches. The dictionary matcher takes a 
dictionary diet, consisting of a set of words/phrases, matches these to the input text and 
outputs spans corresponding to each occurence of a dictionary item in the input text. 


Operator class 

Operators 

Explanation 

Relational 

tr,TT, x,u,n,... 


Span extraction 

^re5 


Span aggregation 

no,n„/3 



Si did S2 

Si and S 2 do not overlap, si precedes S 2 , 

and there are at most d characters between them 

Predicates 

Si ~ S2 

the spans overlap 


Si C S2 

Si is strictly contained within S 2 


Si = S2 

spans are identical 


Table 1: Operator Algebra for Information Extraction 


The span aggregation operators take in a set of input spans and produce a set 
of output spans by performing certain aggregate operations over the input spans. 
There are two main types of aggregation operators - consolidation and block. The 
consolidation operators are used to resolve overlapping matches of the same concept in 
the text. Consolidation can be done using different rules. Containment consolidation 
(ric) is used to discard annotation spans that are wholly contained within other spans. 
Overlap consolidation (Qo) is used to produce new spans by merging overlapping 



Operator 

Explanation 

Span Generating 

Ancestors(us) 

Descendants(i?s) 

Return all the visual spans tor the document d 
Return all ancestor visual spans of vs 

Return all descendant visual spans of us 

Directional 

Predicate 

NorthOf(i;si, VS2) 
StrictNorthOf(usi, VS2) 

Span usi occurs above VS2 in the page layout 

Span usi occurs strictly above VS2 in the page 

Containment 

Predicate 

Contains(usi, VS2) 

Touches(usi, VS2) 

Intersects(usi, VS2) 

usi is contained within VS2 

vsi touches US2 on one of the four edges 

usi and VS2 intersect 

Generalization, 
Specialization 

MaximalHegion(r;s) / 
MinimalRegion(us) 

Returns the largest/smaliest visual span vs^ 
that contains us and the same text content as us 

Geometric 

Area{us) 

Centroid{us) 

Returns the area corresponding to us 

Returns a visual span that has x and y 
coordinates corresponding to the centroid 
of us and text span identical to us 

Grouping 

(Horizontally/Vertically) Aligned 
{VSy consecutive, maxdist) 

Returns groups of horizontally/vertically aligned 
visual spans from VS. If the consecutive flag is 
set, the visual spans have to be consecutive with 
no non-aligned span in between. The maxdist 
limits the maximum distance possible between two 
consecutive visual spans in a group 

Aggregation 

MinimalSuperRegion( V S) 

MinimalBoundingRegion(VS') 

Returns the smallest visual span that contains all 
the visual spans in set VS 

Returns a minimum bounding rectangle of all visual 
spans in set VS 


Table 2: Visual Operators 
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spans. The block aggregation operator (/3) identifies spans of text enclosing a minimum 
number of input spans such that no two consecutive spans are more than a specified 
distance apart. It is useful in combining a set of consecutive input spans into bigger 
spans that represent aggregate concepts. The algebra also includes some new selection 
predicates that apply only to spans as shown in Table 

3.2 Extensions for Visual Information Extraction 

We extend the algebra described in order to support information extraction based on 
visual rules. In addition to the span, we add two new types in our model - Region 
and VisualSpan. A Region represents a visual box in the layout of the page and has 
the attributes: {xi,yi,Xh,yh)■ {xi,yi) and {xh,yh) denote the bounding box of the 
identified region in the visual layout of the document. We assume that the regions are 
rectangles, which applies to most markup languages such as HTML. A VisualSpan 
is a combination of a text based span and a visual region with the following attributes: 
(s, r), where s is a text span having attributes begin and end as before and r is the 
region corresponding to the span. 

The operators are also modified to work with visual spans. The relational operators 
are unchauged. The span extraction operators are modified to return visual spans 
rather than spans. For example, the regular expression operator ere matches the 
regular expression r to the input text aud for each matching text spau s it returns 
its corresponding visual span. Similarly, the dictionary matcher e^ outputs visual 
spans corresponding to occurences of dictionary items in the input text. The behavior 
of the span aggregation operators (He and Hq) is also affected. Thus containment 
consolidation He will discard visual spans whose region and span are both contained in 
the region and span of some other visual span. Overlap consolidation (Hq) aggregates 
visual spans whose text spans overlap. It produces a new visual span whose text span is 
the merge of the overlapping text spans and bounding box is the region corresponding 
to the closest HTML element that contains the merged text span. 

There are two flavors to the block aggregation operator (/?). The text block operator 
(/3s) is identical to the earlier /3 operator. It identifies spans of text enclosing a minimum 
number of input spans such that no two consecutive spans are more than a specified 
distance apart. The region block operator (/3«) takes as input a X distance x and Y 
distance y. It finds visual spans whose region contains a minimum number of input 
visual spans that can be ordered such that the X distance between two consecutive 
spans is less than x and the Y distance is less than y. The text span of the output 
visual spans is the actual span of the text corresponding to its region. 

The predicates described in Table [T] can still be applied to the text span part of 
the visual spans. To compare the region part of the visual spans, we need many new 
predicates, which are described in the next section. 

3.3 Visual Operators 

We introduce many new operators in the algebra to enable writing of rules based on visual 
regions. The operators can be classified as span generating, scalar or grouping operators 
and a subset has been fisted in Table Many of these operators axe borrowed from 
spatial (GIS) databases. For example, the operators Contains, Touches and Intersects are 
available in a GIS database like DB2 Spatial Extendeij^ However, to our best knowledge 
this is the first application of using these constructs for Information Extraction. 

^http://www“01.ibm.coin/software/data/spatial/db2spatial/ 
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Span Generating Operators 

These operators produce a set of visual spans as output and include the Ancestors 
and Descendents operators. 

Scalar Operators 

The scalar operators take as input one or more values from a single tuple and return a 
single value. Boolean scalar operators can be used in predicates and are further classified 
as directional or containment operators. The directional operators allow visual spans to 
be compared based on their positions in the layout. Due to lack of space, we have listed 
only NorthOf, however we have similar predicates for other directions. Other scalar op¬ 
erators include the generalization/specialization operators and the geometric operators. 

Grouping Operators 

The grouping operators are used to group multiple tuples based on some criteria and 
apply an aggregation function to each group, similar to the GROUP BY functionahty 
in SQL. 

3.4 Comparison with Source Based Approach 

If visual algebra is not supported, we would have to impelement a given task using 
only source based rules. The visual algebra is a superset of the existing source based 
algebra. Expressing a visual rule using existing algebra as a source based rule can be 
categorized into one of the following cases: 

1. Identical Semamtics: Some of the visual operators can be mapped directly 
into source level rules keeping the semantics intact. For example, the operator 
Vertically Aligned can be mapped to an expression based on constructs in html 
that are used for alignment such as <tr>, <li> or <p>, depending on the exact 
task at hand. 

2. Approximate Semamtics: Mapping a visual rule to a source based rule with 
identical semantics may lead to very complex rules since there are many ways to 
achieve the same visual layout. It may be possible to get approximately similar 
results by simplifmg the rules if we know that the layout for the pages in the 
dataset is achieved in one particular way. For example, in a particular template, 
alignment may always be implemented using rows of a table (the <tr> tag), so 
the source based rule can cover only this case. 

3. Alternate Semamtics: In some cases, it is not possible to obtain even similar 
semantics from the source based rules. For example, rules based on Area, 
Centroid, Contains, Touches and Intersects cannot be mapped to source 
based rules, since it is not possible to check these conditions without actually 
rendering of the page. In such cases, we have to use alternate source based rules 
for the same task. 

4 System Architecture and Implementation 

This section describes the architecture and our implementation of the visual extraction 
system. There are two models typically used for information extraction - document 
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QUERY PHASE 


Figure 2: System Architecture 

level processing, in which rules are applied to one document at a time and collection 
level processing, in which the rules are matched against the entire document collection 
at once. The document at a time processing is suitable in the scenario where the 
document collection is dynamic and new documents are added over time. The collection 
level processing is useful when the document collection is static and the rules are 
dynamic, i.e. new rules are being developed on the same collection over time. Previous 
work has demonstrated an order of magnitude improvement in performance by col¬ 
lection level processing compared to document level processing with the use of indices 
for evaluating regular expression rules [11]. The visual algebra can be implemented 
using either a document level processing model or a collection level processing model. 
We implemented a collection level processing approach using a relational database 
with extensions for inverted indices on text for efficient query processing. Figure]^ 
depicts the overall system architecture. Collection level processing has two phases: (aj 
Preprocessing phase comprising computations that can be done offline and (b) Query 
phase that includes the online computations done during interactive query time. 

Preprocessing PhEise 

In the preprocessing phase, web pages from which information is to be extracted are 
crawled and a local repository of the web pages is created. Along with the HTML 
source of the web page, all components that are required to render the page accurately, 
such as embedded images and stylesheets, should also be downloaded and appropriately 
linked from the local copy of the page. We use an open source Firefox extension called 
WebPageDum];J^ specifically designed for this purpose. Each page is then rendered 
in a browser and for each node in the DOM tree, its visual region and text is extracted 
(using the Chickenfoot Firefox extensiorQ and stored in a relational database (IBM 
DB2 UDB). We also use the indexing and text search capabilities of DB2 Net Search 
Extendeilj to speed up queries that can benefit from an inverted index. 

^ http://www.dbai.tuwien.ac.at/user/pollak/webpagedump/ 

^ http://groups.csail.mit.edu/uid/chickenfoot/ 

^ http://www.ibm.com/software/data/db2/extenders/netsearch/ 


























Query Phase 

During the interactive query phase, the user expresses the information extraction task 
as operations in the visual algebra. The visual algebraic operations are then translated 
to standard SQL queries and executed on the database 

4.1 Implementing Visual Algebra Queries using a Database 

Schema 

The visual regions computed in the pre-processing stage are stored in table called 
Regions with the following schema: 

< Pageid, Regionid, Xi,yi, Xh,yh,TextStart, TextEnd, 

Text, HtmlTag, Minimal Region, Maximal Region >. 

The Pageid uniquely identifies a page. The html DOM tree is a hierarchical 
structure where the higher level nodes comprise lower level nodes. For example, a <td> 
may be nested inside a <tr> tag, which is nested inside a <table>, and so on. The 
Regionid uniquely identifies a region in a page annd is a path expression that encodes 
the path to the corresponding node. This makes it easy to identify the parents and de¬ 
scendants of a region. For example, a node 1.2 indicates a node reached by following the 
second child of the first child of the root node. The Xi,yi,Xh, yh denote the coordinates 
of the region. The Text field stores the text content of the node with Text Start and 
Text End indicating the offsets of the text within the document. The text content of 
higher level nodes is the union of the text content of all its children. However, to avoid 
duplication, we associate only the innermost node with the text content while storing 
in the Regions table. The MinimalRegion and MaximalRegion fields are used to 
quickly identify a descendant or ancestor that has the identical text content as this node. 

Implementation of Operators 

The visual algebra is implemented using a combination of standard SQL and User 
Defined Functions (UDFs). Due to space constraints, we mention the mapping of 
only some representative operators without going into complete detail in Table For 
simplicity, we have shown the SQL for each operator separately. Applying these rules for 
a general algebra expression will produce a nested SQL statement that can be flattened 
out into a single SQL using the regular transformation rules for SQL sub-queries. We 
also experimented with using a spatial database to implement our algebra, but found 
that it was not very efficient. Spatial databases can handle complex geometries, but 
are not optimized for the simple rectangular geometries that the visual regions have. 
Conditions arising from simple rectangular geometries can be easily mapped to simple 
conditions on the region coordinate columns in a regular relational database. 

Visual span producing operators: The e^e and ea operators are implemented 
using UDFs that implement regular expression and dictionary matching respectively. 
Anscestors(v) and Descendants{v) are implemented using the path expression in the 
region id of vs. Searching for all prefixes of the Regionid returns the ancestors and 
searching for all extensions of Regionid returns the descendants. 

Sp 2 in aggregation operators: The span aggregation operators (Uq, Ug and /3y) 
cannot be easily mapped to existing operators in SQL. We implement these in Java, 
external to the database. 

Other Visual operators: The scalar visual operators include the directional 
predicates, containment predicates, generalization/specialization operators and geo¬ 
metric operators. The predicates map to expressions in the WHERE clause. The 
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Operator 

Mapping 


SELECT Pageid, Regionid FROM Regions 

CreieXp) 

SELECT Pageid, Regionid PROM Regions R 

WHERE MatchesRegex(R.Text, exp) 

tdidict) 

SELECT Pageid, Regionid PROM Regions R 

WHERE MatchesDict(R.Text, diet) 

Ancestorslv) 

SELECT Pageid, Regionid FROM Regions R 

WHERE IsPrefix(R.Regionid, v.Regionid) 

StrictNorthOf{vi,V 2 ) 

. . .WHERE Vi.yh < V2.yi ANIJ 

V}-Xh<v;i-Xh ..^ 

Minimal Regionlv) 

select Pageid, MinimaiRegion FROM Regions 

R 

WHERE R.Regionid = v.Regionid 

Horizontally Aligned{R) 

. . .FROM R GROUP BY R.xi 

Minimal BoundingRegion{V) 

SELECT min(xi), min{yi), max{xh), max{yh) 

FROM V 


Table 3: Mapping to SQL 

generalization/specialization predicates are implemented using the pre-computed values 
in the columns Minimal Region and Maximal Region. The grouping operators map 
to GROUPBY clause in SQL and the aggregate functions can be mapped to SQL 
aggregate functions in a straightforward way as shown for Horizontally Aligned and 
Minimal Bounding Region. 

Use of Indices 

Indices can be used to speed up the text and region predicates. Instead of the 
MatchesRegex UDF, we can use the CONTAINS operation provided by the text 
index. We also build indices on xi,xi,Xh,yh columns to speed up visual operators. 
Once the visual algebra query is mapped to a SQL query, the optimizer performs the 
task of deciding what indices to use for the query based on cost implications. Example 
of a mapping is shown in Table 

5 Experiments 

The goal of the experiments is two fold - to demonstrate the simplicity of visual queries 
and to study the effectiveness of mapping the visual algebra queries to database queries. 
We describe the visual algebra queries for a representative set of tasks, map them to 
SQL queries in a database system and study the effect of indexing on the performance. 

5.1 Experimental Setup 

The document corpus for our experiments consists of software product information pages 
from IBM web site|^ We crawled these pages resulting in a corpus of 44726 pages. Our 
goal was to extract the system requirements information for these products from their 
web pages (see Figure [B. Extracting the system requirements is a challenging task since 
the pages are manualq^ created and don’t have a standard format. This can be broken 
into sub-tasks that we use as representative queries for our experiments. The queries are 
hsted below. The visual algebra expression and the equivalent SQL query over the spa¬ 
tial database are hsted in Table For ease of expression, the visual algebra queries are 

® http:/ /mm. ibm.com/software/products/us/en?pgel=lnav 
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Visual Query 

SQL Query 

4 

select R3.VisualSpan 
from RegEx('operating 
system’, D) as Rl, 

RegEx(’windows’, D) as R2, 
?fi(D) as R3 

where StrictSouthOf(R2, Rl) 
and StrictEastOf(R3, R2) 

SELECT R3.pageid, R3.regionid 

FROM regions Rl, regions R2, regions R3 
WHERE rl.pageid = R2.pageid 

AND R2.pageid = R3.pageid AND con¬ 
tains (Rl .text, '"Operating Systems"’) = 

1 

AND contains(R2.text, '"Windows"’) = 1 

AND R2.yi > Rl.yf, AND R2.xi > Rl.Xi 

AND R2.Xh < Rl.Xh 

AND R3.XI > R2.Xh AND R3.yi > R2.yi 

AND R3.yh < R2.VH 


Table 4: Queries 

specified using a SQL like syntax. The functions RegEx and Diet represent the opera¬ 
tors Cr and td respectively. For each of these sub-tasks, it is possible to write more precise 
queries. However, our goal here is to show how visual queries can be used for a variety of 
extraction tasks without focusing too much on the precision and recall of these queries. 

• Filter the navigational bar at the left edge before extracting the system requirements. 
Ql: Retrieve vertically aligned regions with more than n regions such that the 
region bounding the group is contained within a virtual region A{xi,yi,Xh,yh)- For 
our domain, we found that a virtual region of H(0,90,500, oo) works well. 

• Identify whether a page is systems requirements page. We use the heuristic that 
system requirement pages have the term “system requirements” mentioned near 
the top of the page. 

Q2: Retrieve the region in the page containing the term ’system requirements’ 
contained in a region A. In this case, we use a virtual region, yl(450,0, oo, 500) 

• To identify various operating systems that are supported, the following query can 
be used. 

Q3: Find all regions R, such that R contains one of the operating systems mentioned 
in a dictionary T and are to the strict south or to the strict east of a region containing 
the term “Operating Systems”. 

• To find the actual system requirements for a particular operating system, the 
following query can be used. 

Q4: Find a region that contains the term “Windows” that occurs to the strict south 
of a region containing the term “Operating Systems” and extract a region to the 
strict right of such a region. 

Due to lack of s pace , we show the visual algebra expression and the equivalent 
SQL query (Section |4.1[ ) for only query Q4 in Table For ease of expression, the 
visual algebra queries are specified using a SQL like syntax. 

5.2 Accuracy of Spatial Rules 

We measured the accuracy of our spatial rules using manually annotated data from 
a subset of pages in our corpus. The test set for Q2 and Q4 consists of 116 manually 
tagged pages. The test set for Ql and Q3 contains 3310 regions from 10 pages with 525 
positive examples for Ql and 23 positive examples for Q3. Please note that for Ql and 
Q3 we need to manually tag each region in a page. Since there are few hundred regions 
in a page, we manually tagged only 10 pages. The rules were developed by looking 
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at different patterns that occur in a random sample of the entire corpus. The results 
are reported in Figure Since our tasks were well suited for extraction using spatial 
rules, we were able to obtain a high level of accuracy using relatively simple rules. 

5.3 Performance 

We measured the performance of these queries on the document collection. Since the 
queries have selection predicates on the text column and the coordinates (a;;, yi,Xh,yh), 
we build indices to speed them up. We also index the text column using DB2 Net 
search extender. The running time for the queries are shown in Figure We compare 
various options of using no indices, using only text index and using both text index and 
indices on the region coordinates. For Ql, the text index does not make a difference 
since there is no text predicate. The region index leads to big improvement in the 
time. Q2, QS and Q4 have both text and region predicates and thus benefit from the 
text index as weU as the region indices. The benefit of the text index is found to be 
compartively larger. In all the cases, we can see that using indices leads to a three 
to fifteen times improvement in the query execution times. 
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Figure 3: Effect of Indices 


Figure 4: 

Accuracy 


6 Discussion 

We have demonstrated an extension to the traditional rule based IE framework that 
allows the user to specify layout based rules. This framework can be used for many 
information extraction tasks that require spatial analysis without having to use custom 
code. The WYSIWYE algebra we propose allows the user to seamlessly combine 
traditional text based rules with high level rules based on spatial layout. The visual 
algebra can be systematically mapped to SQL statements, thus enabhng optimization 
by the database. We have evaluated our system in terms of usability and performance 
for a task of extracting software system requirements from software web pages. The 
rules expressed using the visual algebra are much simpler than the corresponding source 
based rules and more robust to changes in the source code. The performance results 
show that by mapping the queries to SQL and using text and region indexes in the 
database, we can get significant improvement in the time required to apply the rules. 

Layout based rules are useful for certain types of pages, where the layout information 
provides cues on the information to extract. A significant soruce of variation in web 
pages (different source code, same visual layout) can be addressed by rule based 
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information extraction systems based on a visual algebra, leading to simpler rules. 
Visual rules are not always a replacement for the text based rules, rather they are 
complementary. In our system, we can write rules that combine both text based and 
layout based rules in one general framework. 
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